1. Problem staterment

  1. Build a classification model to predict if the customer is going to churn or not
  2. Optimize the model using appropriate techniques
  3. Generate a set of insights and recommendations that will help the bank

Import libraries

2. Load data

3.Exploratory Data analysis

3.1 shape and data type

Insights:

  1. There is no missing values in any columns but there are few columns which has values unknown and we may have to treat them as missing value.
  2. We can convert the object type columns to categories.

Fixing the data types

3.2 Basic summary statistics and consequences

Central Tendency of data

Skewness

Insights:

  1. CLIENTNUM: column can be dropped as it is simple Id.
  2. Customer_Age: minmum age is 23 and maximum age is 73 with mean 46.3 and negatively skewed.
  3. Dependent_count: minmum dependent count is 0 and maximum is 5.This is a categorical variable.
  4. Months_on_book : minmum is 13 and maximum is 56 with mean 35.92 and slightly negative skewed.
  5. Total_Relationship_Count :min is 1 and max is 6 with mean 3.8 which does not make sense as relationship count should be only integer. this variable is negatively skewed and is categorical variable.
  6. Months_Inactive_12_mon : Minimum is 0 and maximum is 6. This column has positive skewness.
  7. Contacts_Count_12_mon : Minimum is 0 and maximum is 6. This column seems positively skewed.
  8. Credit_Limit : Minimum is 1438.3 and maximum is 34516. This column has positive skewness and seems having outliers.
  9. Total_Revolving_Bal : minimum is 0 and maximum is 2517. This column definetely has negative skewness.
  10. Avg_Open_To_Buy: min is 3 and max is 34516. This column is positively skewed and definetely have outliers.
  11. Total_Amt_Chng_Q4_Q1 : min is 0 and max is 3.39. it is positively skewed.
  12. Total_Trans_Amt: Min is 510.0 and Max is 18484.0. This is positively skewed.
  13. Total_Trans_Ct : Minimum is 10 and MAximum is 139. This is positively skewed.
  14. Total_Ct_Chng_Q4_Q1 : Minimum is 0 and Maximum is 3.714. This is positively skewed.
  15. Avg_Utilization_Ratio : Minimum is 0 and Maximum is .99. This is positively skewed.

Insights

  1. Attrition_Flag: almost 85% customers were existing Customer.
  2. Gender: % of Female customer is higher than % of Male customers.
  3. Education_Level: Count for graduate customer is highest.
  4. Marital_Status: Count for Married customer is highest.
  5. Income_Category:Count for customer with salary category 'Less than $40K' is highest.
  6. Card_Category: Most frequent card category is Blue.

Checking caterogies for categorical variables

Attrition_Flag is imbalanced which actually make this dataset imbalanced. We can do upsampling to balance this dataset.

3.3 Preprocessing of Data

3.4 Univariate and Bivariate Analysis

3.4.a Univariant Plots continuous variables

plots for Customer_Age

Insights:

Customer_Age : There are some outliers at age 70, 73. in experience variable but it is negatively skewed.

Plot for Dependent_count

Insights:

Dependent_count : There is n outliers. Highest density is for dependent count 3. It is negatively skewed.

Plot for Months_on_book

Insights

Months_on_book : It has outliers and negatively skewed. Median for this variable is 36.0.

Plot for Total_Relationship_Count

Insights:

Total_Relationship_Count : Total_Relationship_Count variable has no outlier and is negatively skewed.Median for Total_Relationship_Count is 4.

Plot for Months_Inactive_12_mon

Insights:

Months_Inactive_12_mon : There are some outliers at 0, 5 and 6. It is highly positively skewed. Median is at 2.

plot for Contacts_Count_12_mon

Insights:

Contacts_Count_12_mon : it has outliers at 0,5 and 6 and has positive skewness. Median is at 2.

plot for Credit_Limit

Insights:

Credit_Limit : it has too many outliers and is highly positively skewed.

plot for Total_Revolving_Bal

Insights:

Total_Revolving_Bal : it has no outliers but negatively skewed. Its curve is not proper bell distribution curve.

plot for Avg_Open_To_Buy

Insights:

Avg_Open_To_Buy : it has too many outliers and highly positively skewed.

plot for Total_Amt_Chng_Q4_Q1

Insights:

Total_Amt_Chng_Q4_Q1 : It has too many outliers and highly positively skewed.

plot for Total_Trans_Amt

Insights:

Total_Trans_Amt : it has too many outliers and highly positively skewed.

plot for Total_Trans_Ct

Insights:

Total_Trans_Ct : it has outliers and positively skewed.

plot for Total_Ct_Chng_Q4_Q1

Insights:

Total_Ct_Chng_Q4_Q1 : it has too many outliers and highly positively skewed.

plot for Avg_Utilization_Ratio

Insights:

Avg_Utilization_Ratio : it has no outliers but positively skewed.

3.4.b Univariant plot for categorical variables

Plot for Attrition_Flag

Plot for Gender

Plot for Education_Level

Plot for Marital_Status

Plot for Income_Category

Plot for Card_Category

Plot for Dependent_count

Plot for Total_Relationship_Count

plot for •Months_Inactive_12_mon

plot for Contacts_Count_12_mon

3.4.c Bivariant Plots

1. pair plot

plotting relationship of continuous variables with each other and Attrition_Flag variable

2. HeatMap

3. Line Plot

Total_trans_Amt and Total_Trans_ct are highly corelated, lets check if they have linear relationship

Months_On_Book and customer_age are highly corelated, lets check if they have linear relationship

Months_on_book and Customer_Age have a linear relationship so we can drop on of them. We will drop Months_on_book

4. Stacked Plot :

Plotting relationship of customer's information with Attrition_Flag

5. Box Plot

Bivariant plot between continuous variable and Attrition_Flag

6. Line plot

plot between Avg_Utilization_Ratio, Customer_Age and Attrition_Flag

Plot between Total_Trans_Amt, Customer_Age and Attrition_Flag

Plot between Total_Revolving_Bal, Customer_Age and Attrition_Flag

Plot between Credit_Limit, Customer_Age and Attrition_Flag

7. Category Plot

Plot of Dependent_count, Customer_Age and Education_Level with Attrition_Flag

Plot of Card_Category, Customer_Age and Total_Relationship_Count with Attrition_Flag

Plot of Months_Inactive_12_mon, Total_Revolving_Bal and Total_Relationship_Count with Attrition_Flag

  1. 6 products and 0,5,6 month
  2. 4 products and 0 inactive months

4. Insights based on EDA

  1. Only 16% customer attrited.
  2. 30.9% customers are graduated followed by 19.9% customers have highest education high school.
  3. almost 93% customers have blue card category.
  4. Data set is non-linear data set.
  5. Credit_Limit has linear relations to Avg_Open_To_Buy and we can drop Avg_Open_To_Buy
  6. Months_on_book and Customer_Age have a linear relationship so we can drop on of them. We will drop Months_on_book.

Card_Category :

  1. Customers with platinium card have maximum attrition count but most of the attrition is for customers with 3 products.
  2. Aattrited Customers with Gold card and total product 1 has age between 40-50 and who have 3 products and gold card have age range 35-55 yrs.

Month_Inactive_12_mon

  1. All of the customers with 1 product and 0 inactive month got attrited.
  2. There is not attrition for customer with 6 and 4 products and 0 inactive month.
  1. Customer with higher contact count in last 12 month have higher attrition count. 100% customers with 6 no of contact have attrited.
  2. attrition rate is highest if customer has 0 inactive months.
  3. Customer attrition is higher for less relationship count(1, 2).
  4. Customer with age 30-60 and Avg_Utilization_ratio less than .2 were attrited.
  5. Customer with age 35-60 and total_trans_amt less than 4000 were attrited.
  6. Customer with age 30-70 and total_revolving_bal less than 1000 were attrited.

5. Data Pre-processing

5.1 Label Encoding:

Label Encoding of Gender, Education_Level,Marital_Status,Income_Category and Card_Category

5.2 Split the data into train and test sets

5.3. Missing value Treatment

Treating Missing value of Income_Category

All of the missing values have been treated.

5.4 Outlier Treatment and Feature Engineering (scaling)

Log transformation of Total_Trans_Ct,Total_Trans_Amt ,Credit_Limit ,Customer_Age

Outlier treatment for Total_Ct_Chng_Q4_Q1-outliers,Total_Trans_Ct,Total_Trans_Amt,Total_Amt_Chng_Q4_Q1,Credit_Limit,Contacts_Count_12_mon,Months_Inactive_12_mon,Months_on_book,Customer_Age

Model Building

* Funtions to show different metrices and confusion matrix

6. Logistic REgression with sampling and regularization

6.1 Building Logistic Regression model

6.2 Logistic regression improvement using unsampling and downsampling

Upsampling using SMOTE

Downsampling large class ( Label :0)

6.3 Regularizing Logistic Regression model

Upsampled model

Downsampling Logistic regression

7. Model building - Bagging and Boosting

7.1 Build Decision Tree Model

Confusion Matrix -

7.2 Bagging Classifier

Bagging Classifier with weighted decision tree

7.3 Random Forest Classifier

Random forest with class weights

7.4 Ada Boost Model

7.5 Gradient Boosting Classifier

7.6 XGBoost Classifier

Comparing all the models - Model performance evaluation

8. Tuning Models - Using GridSearch with pipeline

8.1 Tuning Bagging Classifier

8.2 Tunned AdaBoost Classifier

8.3 Tunned Gradient Boosting Classifier

9. Tuning Models - Using RandomSearch with pipeline

9.1 Tuning Bagging Classifier

9.2 Tunned Adaboost Model

9.3 Tunned Gradient boosting Classifier

10. Model performance evaluation-

Best Model:

Creating tunned ada boost model using the best hyper parameters

11. Actionable insights and Recommendation for business